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Abstract —In this paper we develop a Model Predictive Path 
Integral (MPPI) control algorithm based on a generalized 
importance sampling scheme and perform parallel optimization 
via sampling using a Graphics Processing Unit (GPU). The 
proposed generalized importance sampling scheme allows for 
changes in the drift and diffusion terms of stochastic diffusion 
processes and plays a significant role in the performance of the 
model predictive control algorithm. We compare the proposed 
algorithm in simulation with a model predictive control version 
of differential dynamic programming. 

1. INTRODUCTION 

The path integral optimal control framework [7], [15], 

[16] provides a mathematically sound methodology for de¬ 
veloping optimal control algorithms based on stochastic 
sampling of trajectories. The key idea in this framework is 
that the value function for the optimal control problem is 
transformed using the Feynman-Kac lemma [2], [8] into an 
expectation over all possible trajectories, which is known 
as a path integral. This transformation allows stochastic 
optimal control problems to be solved with a Monte-Carlo 
approximation using forward sampling of stochastic diffusion 
processes. 

There have been a variety of algorithms developed in the 
path integral control setting. The most straight-forward appli¬ 
cation of path integral control is when the iterative feedback 
control law suggested in [15] is implemented in its open 
loop formulation. This requires that sampling takes place 
only from the initial state of the optimal control problem. 
A more effective approach is to use the path integral control 
framework to find the parameters of a feedback control 
policy. This can be done by sampling in policy parameter 
space, these methods are known as Policy Improvement 
with Path Integrals [14]. Another approach to finding the 
parameters of a policy is to attempt to directly sample from 
the optimal distribution defined by the value function [3]. 
Other methods along similar threads of research include [10], 

[17] . 

Another way that the path integral control framework 
can be applied is in a model predictive control setting. 
In this setting an open-loop control sequence is constantly 
optimized in the background while the machine is simulta¬ 
neously executing the “best guess” that the controller has. 
An issue with this approach is that many trajectories must 
be sampled in real-time, which is difficult when the system 
has complex dynamics. One way around this problem is to 
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drastically simplify the system under consideration by using 
a hierarchical scheme [4], and use path integral control to 
generate trajectories for a point mass which is then followed 
by a low level controller. Even though this approach may be 
successful! for certain applications, it is limited in the kinds 
of behaviors that it can generate since it does not consider the 
full non-linearity of dynamics. A more efficient approach is 
to take advantage of the parallel nature of sampling and use 
a graphics processing unit (GPU) [19] to sample thousands 
of trajectories from the nonlinear dynamics. 

A major issue in the path integral control framework is 
that the expectation is taken with respect to the uncontrolled 
dynamics of the system. This is problematic since the proba¬ 
bility of sampling a low cost trajectory using the uncontrolled 
dynamics is typically very low. This problem becomes more 
drastic when the underlying dynamics are nonlinear and 
sampled trajectories can become trapped in undesirable parts 
of the state space. It has previously been demonstrated 
how to change the mean of the sampling distribution using 
Girsanov’s theorem [15], [16], this can then be used to 
develop an iterative algorithm. However, the variance of 
the sampling distribution has always remained unchanged. 
Although in some simple simulated scenarios changing the 
variance is not necessary, in many cases the natural variance 
of a system will be too low to produce useful deviations from 
the current trajectory. Previous methods have either dealt 
with this problem by artificially adding noise into the system 
and then optimizing the noisy system [10], [14]. Or they 
have simply ignored the problem entirely and sampled from 
whatever distribution worked best [12], [19]. Although these 
approaches can be successful, both are problematic in that 
the optimization either takes place with respect to the wrong 
system or the resulting algorithm ignores the theoretical basis 
of path integral control. 

The approach we take here generalizes these approaches in 
that it enables for both the mean and variance of the sampling 
distribution to be changed by the control designer, without 
violating the underlying assumptions made in the path inte¬ 
gral derivation. This enables the algorithm to converge fast 
enough that it can be applied in a model predictive control 
setting. After deriving the model predictive path integral 
control (MPPI) algorithm, we compare it with an existing 
model predictive control formulation based on differential 
dynamic programming (DDP) [6], [13], [18]. DDP is one of 
the most powerful techniques for trajectory optimization, it 
relies on a first or second order approximation of the dynam¬ 
ics and a quadratic approximation of the cost along a nominal 
trajectory, it then computes a second order approximation of 



the value function which it uses to generate the control. 

II. PATH INTEGRAL CONTROL 

In this section we review the path integral optimal control 
framework [7]. Let xj G denote the state of a dynamical 
system at time t, u(x(,f) G M"* denotes a control input for 
the system, r : [to,T] —>■ K" represents a trajectory of the 
system, and dw G is a brownian disturbance. In the path 
integral control framework we suppose that the dynamics 
take the form: 


dx = i{xt,t)dt + G(xt,f)u(xt,f)df + B(xt,f)dw (1) 


In other words, the dynamics are affine in control and subject 
to an affine brownian disturbance. We also assume that G 
and B are partitioned as: 

Expectations taken with respect to Q are denoted as Eq[-], 
we will also be interested in taking expectations with respect 
to the uncontrolled dynamics of the system (i.e 0 with 
u = 0). These will be denoted Ep[-]. We suppose that 
the cost function for the optimal control problem has a 
quadratic control cost and an arbitrary state-dependent cost. 
Let (^(xt’) denote a final the terminal cost, q{xt,t) a state 
dependent running cost, and define R(xt,f) as a positive 
definite matrix. The value function V{xt,t) for this optimal 
control problem is then defined as: 


min Eq 

U 


4>{^t) + 



q(xt,f) + -u^R(xt,f)u )df 


(3) 

The Stochastic Hamilton-!acobi-Bellman equation [1], [11] 
for the type of system in 0 and for the cost function in 0 
is given as: 


-dtV = g(xt,f)-f f(xt,f)'^I4 

- (4) 

-f ^tr(B(xt,f)B(xt,f)'^Lxx) 
where the optimal control is expressed as: 


this transformation we apply an exponential transformation 
of the value function 

L(x,f) =-Alog(T'(x,f)) (6) 

Here A is a positive constant. We also have to assume a 
relationship between the cost and noise in the system (as 
well as A) through the equation: 

Bc(xt,f)Bc(x,f)'^ = AGc(xt,f)R(xt,f)“^Gc(xt,f)'^ (7) 

The main restriction implied by this assumption is that 
B(xt,f) has the same rank as R(xt,f). This limits the 
noise in the system to only effect state variables that are 
directly actuated (i.e. the noise is control dependent). There 
are a wide variety of systems which naturally fall into this 
description, so the assumption is not too restrictive. However, 
there are interesting systems for which this description does 
not hold (i.e. if there are known strong disturbances on 
indirectly actuated state variables or if the dynamics are only 
partially known). 

By making this assumption and performing the exponen¬ 
tial transformation of the value function the stochastic HJB 
equation is transformed into the linear partial differential 
equation: 

dt'^ - - ltr(S(xt,;)'I'xx) 

^ ^ ( 8 ) 

Here we’ve denoted the covariance matrix 

Bc(xt,f)Bc(xt,f)'^ 

as S(xi,f). This equation is known as the backward 
Chapman-Kolmogorov PDE. We can then apply the 
Leynman-Kac lemma, which relates backward PDEs of this 
type to path integrals through the equation: 

q{x,t) df j 4'(xt,T) 

(9) 

Note that the expectation (which is the path integral) is 
taken with respect to P which is the uncontrolled dynamics 
of the system. By recognizing that the term 4 '(x 7’) is the 
transformed terminal cost: we can re-write this 

expression as: 


T'(xto,fo) = Ep 


exp 


1 


to 


u* = -R(xt,f) ^G(xt,t)'^14 (5) 

The solution to this backwards PDE yields the value function 
for the stochastic optimal control problem, which is then 
used to generate the optimal control. Unfortunately, classical 
methods for solving partial differential equations of this 
nature suffer from the curse of dimensionality and are 
intractable for systems with more than a few state variables. 

The approach we take in the path integral control frame¬ 
work is to transform the backwards PDE into a path integral, 
which is an expectation over all possible trajectories of the 
system. This expectation can then be approximated by for¬ 
ward sampling of the stochastic dynamics. In order to effect 


T'(xto,fo) « Ep 


exp 



( 10 ) 


where S{t) = Q(?^t,t)dt is the cost-to-go of 

the state dependent cost of a trajectory. Lastly we have to 
compute the gradient of T' with respect to the initial state x^g. 
This can be done analytically and is a straightforward, albeit 
lengthy, computation so we omit it and refer the interested 
reader to [14]. After taking the gradient we obtain: 

. ,_i Ep[exp(-A^(T))B(xtg,fo)dw] 
u dt y(Xtg,to) t 1 of \\1 

Ep[ exp(-^5'(T))J 


( 11 ) 










Where the matrix Q{xt,t) is defined as: 


R(xt,i) (Gc(xt,t)R(xt,i) ^Gc(xt,t)'^) ^ 

( 12 ) 

Note that if Gc(x(,f) is square (which is the case if the 
system is not over actuated) this reduces to Gc(xt,f)“^. 


Equation (111 is the path integral form of the optimal 


control. The fundamental difference between this form of 
the optimal control and classical optimal control theory is 
that instead of relying on a backwards in time process, 
this formula requires the evaluation of an expectation which 
can be approximated using forward sampling of stochastic 
differential equations. 


In order to efficiently approximate the controls, we require 
the ability to sample from a distribution which is likely to 
produce low cost trajectories. In previous applications of 
path integral control [15], [16] the mean of the sampling 
distribution has been changed which allows for an iterative 
update law. However, the variance of the sampling distri¬ 
bution has always remained unchanged. In well engineered 
systems, where the natural variance of the system is very 
low, changing the mean is insufficient since the state space 
is never aggressively explored. In the following derivation 
we provide a method for changing both the initial control 
input and the variance of the sampling distribution. 

A. Likelihood Ratio 


A. Discrete Approximation 

Equation o provides an expression for the optimal 
control in terms of a path integral. However, these equations 
are for continuous time and in order to sample trajectories 
on a computer we need discrete time approximations. 

We first discretize the dynamics of the system. We have 
that Xj+i = xt + dxt where dxj is defined as: 

dxt = (f(xt,f) + G(xt,f)u(xt,f)) At + B(xt,t)eVAt 

(13) 

The term e is a vector of standard normal Gaussian random 
variables. Eor the uncontrolled dynamics of the system we 
have: 

dxt = f(xt,t)At + B(xt,t)ev/^ (14) 

Another way we can express B(x 4 ,t)dw which will be 
useful is as: 


B(xt,t)dw « dxt - f(xt,t)At (15) 


Lastly we say: S{t) « +J2f=o where N = 

(T — t)/At Then by defining p as the probability induced by 
the discrete time uncontrolled dynamics we can approximate 
( [TT] i as: 


u* = g{xto,toy 


Er 


exp(-x<S(r)) ( 


^ - f(xto,to) 


Ep[ exp(-i5'(r))] 


(16) 


Note that we have moved the At term multiplying u over 
to the right-hand side of the equation and inserted it into the 
expectation. 


HI. GENERALIZED IMPORTANCE SAMPLING 

Equation provides an implementable method for 

approximating the optimal control via random sampling of 
trajectories. By drawing many samples from p the expecta¬ 
tion can be evaluated using a Monte-Carlo approximation. In 
practice, this approach is unlikely to succeed. The problem is 
that p is typically an inefficient distribution to sample from 
(i.e the cost-to-go will be high for most trajectores sampled 
from p). Intuitively sampling from the uncontrolled dynamics 
corresponds to turning a machine on and waiting for the 
natural noise in the system dynamics to produce interesting 
behavior. 


We suppose that we have a sampling distribution with non¬ 
zero control input and a changed variance, which we denote 
as q, and we would like to approximate ([T^ using samples 
from q as opposed to p. Now if we write the expectation 
term in integral form we get: 


/exp(-i5'(T)) -f(xt,f)) p(T)d 7 


(17) 


/exp(-iS'(T)) p(T)dT 

Where we are abusing notation and using r to represent the 
discrete trajectory (xj^, xt^,... Xt„). Next we multiply both 
integrals by 1 = to get: 


/exp(-iS'(T)) -f(xt,f)) ^p(r)dT 


(18) 


/exp(-i5(T)) ^p(T)dT 

And we can then write this as an expectation with respect 
to q: 


(19) 


Eq 

exp(-x5(r)) 


eM 

qp). 

Eq 

exp (- 




We now have the expectation in terms of a sampling distri¬ 
bution q for which we can choose: 

i) The initial control sequence from which to sample 
around. 

ii) The variance of the exploration noise which determines 
how aggressively the state space is explored. 

However, we now have an extra term to compute This is 
known as the likelihood ratio (or Radon-Nikodym derivative) 
between the distributions p and q. In order to derive an 
expression for this term we first have to derive equations 
for the probability density functions of p(t) and qjr) indi¬ 
vidually. We can do this by deriving the probability density 
function for the general discrete time diffusion processes 
P(t), corresponding to the dynamics: 

dxt = (f(xt,f) -f G(xt,f)u(xt,f)) At -f B(xt,t)eVAt 

( 20 ) 

The goal is to find P{t) = P(xtj,,xtj,.. .Xt^^). By condi¬ 
tioning and using the Markov property of the state space this 
probability becomes: 


N 


P(xto,Xt,,...Xt„) = ]jp(xtjxt,_j 


( 21 ) 













Now recall that a portion of the state space has deterministic 
dynamics and that we’ve partitioned the diffusion matrix as: 

We can partition the state variables x into the deterministic 
and non-deterministic variables and respectively. 
The next step is to condition on = F^°‘^ (x^, t) = Xj“^ + 
(f(“)(xt,f) + G(“)(xt,f)ut) dt since if this does not hold 
P{t) is zero. We thus need to compute: 

N 

(xtjx(,-j,xj“^ = (23) 

i=l 


And from the dynamics equations we know that each of these 
one-step transitions is Gaussian with mean: f(‘^^(xt,f) + 
ti)u(xt-,ti) and variance: 


Ei = Bc(xt,,fj)Bc(xt,,fj)'^Af. (24) 

We then define and /ii = 

Applying the definition of the Gaus¬ 
sian distribution with these terms yields: 


N 


P{r) = n 


exp (zi - ijiif ^ (zi - 


(27r)V2|E,|i/2 


(25) 


And then using basic rules of exponents this probability 
becomes: 

Z{T)-^exjp (26) 

Where Z{t) = With this equation 

in hand we’re now ready to compute the likelihood ratio 
between two diffusion processes. 


Theorem 1: Let p(r) be the probability density function 
for trajectories under the uncontrolled discrete time dynam¬ 
ics: 


dxt = f (xt, f) Af -I- B (xt, f) ev/^ (27) 


And let q(T) be the probability density function for trajecto¬ 
ries under the controlled dynamics with an adjusted variance: 


dxt = (f(xt, t) + G(xt, f)u(xt, t)) M+ 

B£:(xt,f)ev/Af ( 28 ) 
Where the adjusted variance has the form: 

And define z^, and as before. Let Qi be defined as: 

Qi — iZi Ti) Ti) ( 29 ) 

+ 2 (/Ti) Ej ^ (zi — /ii) -f Ej 


= (E-I-A^EzA.) 


-1 


(30) 


Then under the condition that each At- is invertible and each 
Li is invertible, the likelihood ratio for the two distributions 


is: 


N 


Vi=l 


exp 




(31) 


Proof: In discrete time the probability of a trajectory is 


formulated according to the (26 1 . We thus have p(t) equal 
to: 


p(r) = 

and g(r) equal to: 




(32) 


exp(^-^ ^(zz-M, 


(33) 


Z^ir) 

Then dividing these two equations we have as: 

(n (2^)n/2|S^|l/2) j 2 

Where Q is: 

0 = (z^Sj-^Zz - (Zz - (A^EzAt,) ^ (Zi - /Tz)) (35) 

Using basic rules of determinants it is easy to see that the 
term outside the exponent reduces to 


(27r)"/2|E,|i/2) 


(36) 


So we need only show that Q reduces to Qi. Observe that at 
every timestep we have the difference between two quadratic 
functions of z^, so we can complete the square to combine 
this into a single quadratic function. If we recall the definition 
of Ti from above, and define A^ = Aj/ZiAt- then completing 
the square yields: 


Ci — ^Ti) Pi ^ ^Ti) 

- m?a-Vz - (rzA-Vz)^r-i (TtA-Vz) 
Now we expand out the first quadratic term to get: 


(37) 


C» = z^Pz ^Zz + 2njA^ + fjjA^ ^TzAz Vi 

- /.^a-V z - (rzA-Vz)^r-^(rzA-Vz) 

Notice that the two underlined terms are the same, except 
for the sign, so they cancel out and we’re left with: 

Cz = z^Pz’^Zz -f 2/r^Az"^Zi - m?Az" Vz (39) 

Now define z^ = z^ — /r^, and then re-write this equation in 
terms of z^: 

Cz~(^zTA*z) Tz (^z TA^z) + 2/rj Aj (z^-f/Xi) —A^ 

(40) 


Where L^ is: 












which expands out to: 


Then by re-defining the running cost q{xt,t) as: 


Ci = z?r. + 2p^r. + fijr. V* 

+ + 2a<?A-V* - 

Which then simplifies to: 

C* = + 2fj,Jr-^Zi + 

+ ‘2tJ-JK^Zi + 


(41) 


(42) 


Now recall that T^ = (Sj — so we can split the 

^—1 V* —1 A—1 


into the S, ^ and A, ^ components. 
C = + 2/ri^S-^Zi - 2fif A-'^Zi + Vi 


quadratic terms in T^ 
Doing this yields: 


A^i Aj fJ-i + 2fj,^ Aj Zi + Aj 


(43) 


and by noting that the underlined terms cancel out we see 
that we’re left with: 


Ci ~ Zj Tj Zi + 2/ij Sj Zi + Ej /ii 


(44) 


which is the same as: 

(zi - (zi - /r,) -f 2^J.JE-^ {zi - ^l^) + M^E^Vi 

(45) 

And so Ci = Qi which completes the proof. ■ 

The key difference between this proof and earlier path inte¬ 
gral works which use an application of Girsanov’s theorem 
to sample from a non-zero control input is that this theorem 
allows for a change in the variance as well. 

In the expression for the likelihood ratio derived here 
the last two terms (2p,^E“^ (z^ — fj,i) + /i^E“^p,i) are ex¬ 
actly the terms from Girsanov’s theorem. The first term 
({zi — T~^ {zi — Hi)), which can be interpreted as pe¬ 

nalizing over-aggressive exploration, is the only additional 
term. 

B. Likelihood Ratio as Additional Running Cost 

The form of the likelihood ratio just derived is easily 
incorporated into the path integral control framework by 
folding it into the cost-to-go as an extra running cost. Note 
that the likelihood ratio appears in both the numerator and 


denominator of (16i. Therefore, any terms which do not 


depend on the state can be factored out of the expectation 
and canceled. This removes the numerically troublesome 
normalizing term n^iVtjl- So only the summation of Qi 
remains. Recall that E = AG(x(, f)R(xt, f)“^G(xt, f). This 
implies that: 


r = 


A (^(G(xt,f)R(xt,t) ^G(xt,f)) ^ 

- (A'^G(xt,f)R(xt,f)“^G(xt,f)'^A) 

Now define H = G(xt, f)R(xt, A)“^G(xt, and f = ^T. 
We then have: 

Q=t f(z-V 


A 


tTp-l 


z —^)-|-2^^H {z — h) + H^^ 

( 47 ) 


1 


g(x, u, dx) = q{Kt, t) + ^{z- nfr Vz - /i) 

2 (48) 

+ (z - m) + V 


and S{t) = we have: 


u. 


Eq 

exp 

4^(r))(^-f(x„f))' 

Eq 

exp 



(49) 

Also note that dxj is now equal to: 

(f(xt, t) + G(xt, f)u(xt, t)) At + B(xt, t)e\fAt (50) 
- f(xt,t) as: 


So we can re-write ^ 


G(xt,t)u(xt,f) -f B(xt,f) 


s/At 


(51) 


And then since G(xt,f) does not depend on the expectation 
we can pull it out and get the iterative update law: 


Ut=0(xt,f) ^G(xt,f)u(xt,f) 

+ g{yit,t)~^ 

C. Special Case 


Eq 

exp 

]^5(r)) B(xt,f)^ 

Eq 

exp 



(52) 


The update law (52 1 is applicable for a very general class 
of systems. In this section we examine a special case which 
we use for all of our experiments. We consider dynamics of 
the form: 

dxt = i{xt,t)At + G(xt, t) ^u(xt, f)Af -I- 

(53) 

And for the sampling distribution we set A equal to y/ul. 
We also assume that Gc(xt,f) is a square invertible matrix. 
This reduces 'R(x(,f) to Gc(xt,f)“^. Next the dynamics 
can be re-written as: 

dxt = f{xt,t)At + G(xt,f) ^u(xt,f) -f 

^ ^ (54) 

Then we can interpret as a random change in the 

control input, to emphasize this we will denote this term as 
We then have B(xt,f)^ = G(xt,f)(5u. 
This yields the iterative update law as: 


u(xt,f)* = u(xt,f) -f 


E„ 


exp (-i'^(V) 


i5u 


En 


exp - 


(-A^W) 


(55) 


which can be approximated as: 


u{xt,,U)* « u(xt,,fi) -f 


Ef=iexp 

Ef=iexp(-4^(Ti,fc)) 


( 56 ) 



























Where K is the number of random samples (termed rollouts) 
and S{Ti^k) is the cost-to-go of the ktu rollout from time ti 
onward. This expression is simply a reward-weighted average 
of random variations in the control input. Next we investigate 
what the likelihood ratio addition to the running cost is. For 
these dynamics we have the following simplifications: 

i) z —/i = G(xt, f)^u 

ii) f= (1 - f)“^R(xt, t)G(xt, t) 

hi) H-i = G(xt,t)-iR(xt,t)G(xi,f)-i 

Given these simplifications q reduces to: 

g(x,u,dx) = g(xt,f) + ^ —^Ju'^RJu 

^ (j /) 

-f u^R(5u -t- -u^Ru 

This means that the introduction of the likelihood ratio 
simply introduces the original control cost from the optimal 
control formulation into the sampling cost, which originally 
only included state-dependent terms. 

IV. MODEL PREDICTIVE CONTROL ALGORITHM 

We apply the iterative path integral control update law, 
with the generalized importance sampling term, in a model 
predictive control setting. In this setting optimization and 
execution occur simultaneously: the trajectory is optimized 
and then a single control is executed, then the trajectory is 
re-optimized using the un-executed portion of the previous 
trajectory to warm-start the optimization. This scheme has 
two key requirements: 

i) Rapid convergence to a good control input. 

ii) The ability to sample a large number of trajectories in 
real-time. 

The first requirement is essential because the algorithm 
does not have the luxury of waiting until the trajectory has 
converged before executing. The new importance sampling 
term enables tuning of the exploration variance which allows 
for rapid convergence, this is demonstrated in Eig. [T] 

The second requirement, sampling a large number of 
trajectories in real-time, is satisfied by implementing the 
random sampling of trajectories on a GPU. The algorithm 
is given in Algorithm [T] in the parallel GPU implementation 
the sampling for loop (for k to K-1) is run completely in 
parallel. 


V. EXPERIMENTS 

We tested the model predictive path integral control algo¬ 
rithm (MPPI) on three simulated platforms (1) A cart-pole, 
(2) A miniature race car, and (3) A quadrotor attempting 
to navigate an obstacle filled environment. Eor the race car 
and quadrotor we used a model predictive control version 
of the differential dynamic programming (DDP) algorithm 
as a baseline comparision. In all of these experiments the 
controller operates at 50 Hz, this means that the open loop 
control sequence is re-optimized every 20 milliseconds. 


Algorithm 1; Model Predictive Path Integral Control 

Given: K: Number of samples; 

N\ Number of timesteps; 

(uq, Ui, ...UAr_i): Initial control sequence; 
At,xto,f,G,B, V. System/sampling dynamics; 

(f), q, R, A: Cost parameters; 

Uinit: Value to initialize new controls to; 

while task not completed do 
for fc •(— 0 to AT — 1 do 

X = xt„; 

for i ^ 1 to A — 1 do 

Xi+i = Xi -f (f -f G (Ui -f Af; 
S{Ti+i,k) = S{n^k) + g; 


for I ^ 0 to A — 1 do 


Ui ^ Ui 


'T 

l^k=l 


exp(-JS(Ti,fc))i5ui,fc \ 
ELiexp(-iS(Ti,fc)) J 


send to actuators (uq); 

for i ^ 0 to A — 2 do 

|_ Ui = Ui+i; 

UaT-I = Uinit 

Update the current state after receiving feedback; 
check for task completion; 


A. Cart-Pole 

Eor the cart-pole swing-up task we used the state cost: 
g(x) = -f 500(1 -f cos(0))^ p^, where p is the 

position of cart, p is the velocity and 9, 9 are the angle and 
angular velocity of the pole. The control input is desired 
velocity, which maps to velocity through the equation: p = 
10{u — p). The disturbance parameter ^ was set equal .01 
and the control cost was R = 1. We ran the MPPI controller 
for 10 seconds with a 1 second optimization horizon. The 
controller has to swing-up the pole and keep it balanced for 
the rest of the 10 second horizon. The exploration variance 
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Fig. 1. Average running cost for the cart-pole SAving-up task as a function of 
the exploration variance u and the number of rollouts. Using only the natural 
system vaiiance the MFC algorithm does not converge in this scenario. 
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parameter, v, was varied between 1 and 1500. The M 
controller is able to swing-up the pole faster with increai 
exploration variance. Fig.[T]illustrates the performance of 
MPPl controller as the exploration variance and the nun 
of rollouts are changed. Using only the natural varianci 
the system for exploration is insufficient in this task, in uiai 
case (not shown in the figure) the controller is never able to 
swing-up the pole which results in a cost around 2000. 

B. Race Car 



MPC-DDP 


MPPI 
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Fig. 3. Comparison of DDP (left) and MPPI (right) performing a cornering 
maneuver along an ellipsoid track. MPPI is able to make a much tigther 
turn while carrying more speed in and out of the corner than DDP. The 
direction of travel is counterclockwise. 


In the race car task the goal was to minimize the objective 
function; g(x) = lOOd^ -I- {vx — 7.0)^. Where d is defined 
as: d = l(^) + (I) — 1|, and Vx is the forward (in 
body frame) velocity of the car. This cost ensures that 
the car to stays on an elliptical track while maintaining a 
forward speed of 7 meters/sec. We use a non-linear dynamics 
model [5] which takes into account the (highly non-linear) 
interactions between tires and the ground. The exploration 
variance was set to a constant v times the natural variance 
of the system. The MPPI controller is able to enter turns at 



Fig. 2. Performance comparison in terms of average cost between MPPI 
and MPC-DDP as the exploration variance u changes from 50 to 300 and 
the number of rollouts changes from 10 to 1000. Only with a very large 
increase in the exploration variance is MPPI able to outperform MPC-DDP. 
Note that the cost is capped at 25.0 

close to the desired speed of 7 m/s and then slide through 
the turn. The DDP solution does not attempt to slide and 
significantly reduces its forward velocity before entering the 
turn, this results in a higher average cost compared to the 
MPPI controller. Fig. shows the cost comparison between 
MPPI and MPC-DDP, and Figures and show samples of 
the trajectories taken by the two algorithms as well as the 
velocity prohles. 

C. Quadrotor 

The quadrotor task was to fly through a held filled 
with cylindrical obstacles as fast as possible. We used the 
quadrotor dynamics model from [9]. This is a non-linear 
model which includes position, velocity, euler angles, angular 
acceleration, and the rotor dynamics. We randomly generated 
three forests, one where obstacles are on average 3 meters 
apart, the second one 4 meters apart, and the third 5 meters 
apart. We then separately created cost functions for both 
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-- DDPKI 

MPPI \Vy\ 



Time (s) 


Fig. 4. Comparison of DDP (left) and MPPI (right) performing a cornering 
maneuver along an ellipsoid track. MPPI is able to make a much tigther 
turn while carrying more speed in and out of the comer than DDP. 

MPPI and DDP which guide the quadrotor through the forest 
as quickly as possible. The cost function for MPPI was 



Fig. 5. Left: sample DDP trajectory through 4m obstacle field, Right: 
Sample MPPI trajectory through the same field. Since the MPPI controller 
can directly reason about the shape of the obstacles it is able to safely pass 
through the field taking a much more direct route. 

of the form; g(x) = 2B{px - p't^f + 2.5(py - pf-h 
150(p^ - pf^f + -p ||^||2+350exp(-^) -f lOOOC 
where {px,Py,Pz) denotes the position of the vehicle, ip 
denotes the yaw angle in radians, v is velocity, and d is 
the distance to the closest obstacle. C is a variable which 
indicates whether the vehicle has crashed into the ground or 
an obstacle. Additionally if C = 1 (which indicates a crash), 
the rollout stops simulating the dynamics and the vehicle 
remains where it is for the rest of the time horizon. We found 
that the crash indicator term is not useful for the MPC-DDP 
based controller, this is not surprising since the discontinuity 



























it creates is difficult to approximate with a quadratic function. 
The term in the cost for avoiding obstacles in the MPC- 
DDP controller consists purely of a large exponential term: 
2000 exp(—note that this sum is over all the 
obstacles in the proximity of the vehicle whereas the MPPI 
controller only has to consider the closest obstacle. 
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Fig. 6. Time to navigate forest. Comparison between MMPI and DDR 

Since the MPPI controller can explicitly reason about 
crashing (as opposed to just staying away from obstacles), 
it is able to travel both faster and closer to obstacles than 
the MPC-DDP controller. Fig. |7] shows the difference in time 
between the two algorithms and Fig. the trajectories taken 
by MPC-DDP and one of the MPPI runs on the forest with 
obstacles placed on average 4 meters away. 



Fig. 7. Simulated forest environment used in the quadrotor navigation task. 

VI. CONCLUSION 

In this paper we have developed a model predictive path 
integral control algorithm which is able to outperform a 
state-of-the-art DDP method on two difficult control tasks. 
The algorithm is based on stochastic sampling of system 
trajectories and requires no derivatives of either the dynamics 
or costs of the system. This enables the algorithm to naturally 
take into account non-linear dynamics, such as a non-linear 
tire model [5]. It is also able to handle cost functions which 
are intuitively appealing, such as an impulse cost for hitting 
an obstacle, but are difficult for traditional approaches that 
rely on a smooth gradient signal to perform optimization. 
The two keys to achieving this level of performance with a 
sampling based method are: 

i) The derivation of the generalized likelihood ratio be¬ 
tween discrete time diffusion processes. 


ii) The use of a GPU to sample thousands of trajectories 
in real-time. 

The derivation of the likelihood ratio enables the designer 
of the algorithm to tune the exploration variance in the path 
integral control framework, whereas previous methods have 
only allowed for the mean of the distribution to be changed. 
Tuning the exploration variance is critical in achieving a high 
level of performance since the natural variance of the system 
is typically too low to achieve good performance. 

The experiments considered in this work only consider 
changing the variance by a constant multiple times the 
natural variance of the system. In this special case the 
introduction of the likelihood ratio corresponds to adding in 
a control cost when evaluating the cost-to-go of a trajectory. 
A direction for future research is to investigate how to 
automatically adjust the variance online. Doing so could 
enable the algorithm to switch from aggressively exploring 
the state space when performing aggressive maneuvers to 
exploring more conservatively for performing very precise 
maneuvers. 
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